Big data collection in pharmaceutical manufacturing and its use for product quality predictions

Advances in data science and digitalization are transforming the world, and the pharmaceutical industry is no exception. Multiple sensor-equipped manufacturing processes and laboratory analysis are the main sources of primary data, which have been utilized for the presented dataset of 1005 actual production batches of selected medicine. This dataset includes incoming raw material quality results, compression process time series and final product quality results for the selected product. The data is highly valuable for it provides an insight into every 10 seconds of the process trajectory for 1005 actual production batches along with product quality collected over several years. It therefore offers an opportunity to develop advanced analysis models and procedures which would lead to the omission of current conventional and time consuming laboratory testing. Benefits for both the industry and patient are obvious: reducing product lead times and costs of manufacture.

www.nature.com/scientificdata www.nature.com/scientificdata/ The data presented in this study were collected from three main databases: "Laboratory Sample Manager Database", production database, and process time series database connected with tablet press SQL database.
Incoming materials have a significant impact on final product quality, and when combined with process data, additional insight into the product may be obtained. The dataset presented in detail below offers an opportunity for researchers to find alternative ways to determine product quality.

Methods
Due to the sensitivity of industry data, all data descriptors (i.e., batch number, product code, excipient batches, etc.) were anonymized.
Dataset scope. The product or product family in the scope of the research has several product sub-families, which are defined by product code. Product sub-families differ in strength and manufacturing batch size. There are four different strengths and nine different batch sizes present in the research dataset. Products of different strengths within the scope have proportional or semi-proportional formulations and only differ in the weight of the final tablet, keeping formulation ratios the same. In order to account for the differences between product sub-families, categorical data are also included in the research dataset.
The data collected for the present research range from November 2018 to April 2021. The time interval exceeding one year ensures that seasonal variation, changes in incoming raw materials, the impact of operator shift work, holidays, and other common process and equipment variability, are all taken into account. It is thus safe to assume that the presented dataset is robust and representative of the selected product. Data sources. The primary data sources are laboratory analysis results of incoming raw materials (excipients and API), of the intermediate product (tablet cores), and of the final product. The analyses were performed by trained laboratory technicians specialized in corresponding test. Devices used for analysis ranged from HPLC (high-performance liquid chromatography), GC (gas chromatography), moisture analyzer and particle size analyzer to automatic tablet cores analyzer.
The second primary source of data are the tablet compression process time series. Time series output, such as tablet press speed, compaction force, fill depth, etc., is generated by tablet press sensors ( Table 2). Time series output is generated for every second of the process and is stored in the tablet press SQL database. From there, time series are uploaded to a server that allows for visualization or extraction of the data by domain experts. This data is semi-structured and requires cleaning and organizing before use.
Data collection methods. Before accessing and exporting securely the stored laboratory and process data, the so-called batch genealogy was performed. All laboratory and process data in the above-mentioned databases are stored using batch identifiers. In order to extract the relevant data from databases, it was necessary to determine the corresponding raw material batches that entered into each of the 1,005 final product batches included in this data descriptor study. Only after this initial information was known, did the process of data collection begin.
We exported the data by product material code (i.e., product sub-family), which groups all the batches that have been manufactured under that particular code. The export filter settings, therefore, included the time interval, product code, and laboratory analysis range.
The process time series export was more challenging compared to the laboratory data, due to the quantity of the data. The tablet compression process typically runs between 2 hours and 20 hours, depending on product sub-family (i.e., product code), which defines the batch size (i.e., the target number of tablets produced). The

Incoming materials
Excipients' characteristics Active pharmaceutical ingredient characteristics

Process time series
Intermediate product characteristics

Final product
Laboratory analysis Product quality Table 1. Overview of data sources for a pharmaceutical product.

Incoming raw materials
Active pharmaceutical ingredient, excipients  www.nature.com/scientificdata www.nature.com/scientificdata/ larger the batch size, the longer the process. We exported the data for every 10 seconds of the process and thus slightly reduced complexity, while still keeping all important process information.
The exported batch time series data provided datasets of several thousand rows (see details in the Data Records section). The main identifier of this type of data is a timestamp, which was unstructured and needed preprocessing due to different time formats present in the primary data.
preprocessing of laboratory data. The laboratory data includes the results from the incoming raw material analysis (independent variables), intermediate product quality (independent variables), and final product quality (dependent variables). We visualized every exported parameter (i.e., laboratory analysis results) to observe for any outliers.
Another aspect that needed to be evaluated was whether the data had the expected spread based on product and process knowledge. The following preprocessing steps were applied: • Batches with no final product quality results available were excluded from the dataset.
• Four different product "sub-families" have four different target tablet core weights, which would have affected further analysis. For this reason, a normalized parameter, weight relative standard deviation (RSD), was included instead. However, for the purpose of alternative future analysis approaches, the original weight data for every batch is kept in the provided dataset. • Every product code (sub-family) also has a different target thickness, diameter and hardness. We therefore decided to prepare a new parameter normalized with respect to tablet dimensions and shape -tensile strength (Eq. 1) 12,13 . Unlike hardness, tensile strength is a normalized parameter comparable across product sub-families. Since all product sub-families have the same cylindrical tablet shape, no alterations to Eq. 1 are needed. A new parameter was created by applying the following equation for every batch: Tensile stregth F t d : 2 ; (1) σ π = * * * where F represents tablet hardness in Newton (N), t is tablet thickness in millimeters (mm) and d represents tablet diameter in millimeters. The average hardness value and the maximum thickness and diameter for both tablet cores and film-coated tablets were considered in the calculation. • Product quality parameters included in the dataset are final product impurities, residual solvents and drug release results.
preprocessing of time series data. We performed visualization on time series to evaluate the quality of data, any unusual events, and any requirements for special preprocessing or complete removal of certain batches. Among all the parameters included in the time series dataset, tablet press speed and the number of rejected tablets, counted by the press, were the most descriptive of batch dynamics. The first parameter indicates the time when batch tablet production started and whether there were any unusual process interruptions. The second parameter selected was the number of rejected tablets, because it helps to understand whether tablet press speed reaching 0 is due to the process being finished or due to another start-up or more challenging issues occurring during the process. If the number of rejected tablets does not increase, then the process is finished. Furthermore, the trajectory of rejected tablets in a batch cannot fluctuate; it can only increase with time. The following preprocessing and cleaning steps were taken: • The process time series were quality checked by applying the above explained visualization of the two parameters for every single batch included in this dataset (1,005 batches in total). • We have observed that some batches were stopped because of the weekends or bank holidays. Due to this, a considerable part of the time series displays the value of 0, as the tablet press was paused. When dealing with this data, this particular characteristic needs to be considered, as it could hinder future analysis. These parts have been deliberately kept as part of the time series dataset, because leaving blended powdered material to rest for prolonged periods of time could potentially impact the quality of compressed tablets. • The tablet press SQL database includes unstructured data, and the transfer of these into a readily available database (iHistorian) resulted in a mixed, 12-and 24-hour time structure. Standardization of the time format was performed. • Visualization enabled us to detect a drop in the rejected tablet parameter for some batches. This required a detailed data investigation for those particular batches. We discovered that this issue had occurred with batches that spread over two or more days because of the incorrect date structure in some batches. Corrections were made for all batches where the issue was detected.  www.nature.com/scientificdata www.nature.com/scientificdata/ • Some time series datasets had a so-called "gap", meaning that for a period of several minutes, there were no data available, i.e., empty cells. This occurs when the tablet press is shut down for whatever reason, e.g., weekend, shift change, calibration of the tablet press, calibration of the automatic weight and hardness control system, etc. Since these gaps of missing data have no value for analysis, they were excluded. • Once all the above steps were applied and the data has been cleaned and the time structure corrected, another visualization cycle was performed to check whether all the preprocessing steps have worked.  www.nature.com/scientificdata www.nature.com/scientificdata/ Only the correctly preprocessed, high quality time series data were kept and are presented in this research.

List of parameters in laboratory data file
time series data reduction. The time series presented a vast amount of data compared to laboratory result entries for each batch. The aim was to create new parameters for each batch that would best describe the original time series data for a particular batch.
These new parameters were then combined with the laboratory data linked with the same product batch. Such data manipulation allows for further in-depth analysis, process understanding, and product quality predictions. The time series data were provided intact for this data descriptor paper as well as in a reduced format; as new attribute vectors that were tailored separately for each process parameter based on expert knowledge.
The following feature extraction steps were applied and new attributes created for every batch time series dataset: • Average tablet press speed: the average speed of the tablet press machine, excluding the tablet press values of 0 due to known longer interruptions, such as weekends and holidays, which would mask the true process characteristics. • Number of tablet press speed changes: tablet press speed is generally set in the beginning of the process and usually only changes or drops to 0 if issues with tablet compression arise. The best process is a constant process without interruptions or changes to the parameters. This attribute was normalized by batch size, because larger (and therefore longer) batches will naturally have more interruptions than smaller ones. • Tablet press speed of 0: the total time the tablet press speed was 0. This parameter, too, was normalized by batch size for the same reason as explained above. It was necessary to consider if a batch was run or paused during the weekends and holidays. For this reason, we created an additional categorical attribute to note which batches had a "weekend run". If the tablet press was stopped during a weekend or holiday, this time was subtracted from the total time the tablet press speed was 0. • Total number of rejected tablets: this parameter is cumulative and tells us for every moment of the process how many tablets have been rejected up to a certain point. It was normalized with batch size. • Number of rejected tablets during start-up: it gives an indication how much time and effort was needed to prepare tablet press parameters for a particular blend. The higher the amount of rejected tablets, the more effort was needed due to more challenging material properties or potentially an inexperienced operator. Either way, both have a potential impact on tablet quality. Normalization is not needed, because the size of the batch does not have an impact on the set-up of the tablet press. Tablet press speed: it indicates when the process is running and when it has stopped, if there were many changes to this parameter or many stoppages, the material handling is challenging, which may indicate suboptimal product quality.
fom rpm Filling device speed in rotations per minute: similar to tablet press speed. If the process is running, so is the filling device. This parameter generally does not change and is only set during the start-up. If many changes (during the start-up) are observed, this again indicates potential difficulties with incoming material handling. main_comp kN Main compression force -mean value: the more constant this parameter is, the more homogeneous is the incoming material blend in terms of physical properties.
tbl_fill mm Tablet fill depth: defines the volume of filled blended material to be compressed. If flow properties of material are poor, this parameter will vary throughout the batch and will consequently impact tablet hardness and weight.
SREL % Main compression force -standard relative deviation: this parameter is calculated by the tablet press itself by using main compression force mean values. It gives an indication of how uniform the tablets compacted are.

pre_comp kN
Pre-compression force -mean value: if pre-compression force is used for tablet compaction, this parameter will be greater than 1 and will give a similar indication as main compression force. It is not readily used for the product in the scope.
produced tablets Good production: all acceptable tablets that have been produced at that particular timestamp.

waste tablets
Bad production: tablets that do not pass the set tablet press parameters (i.e., max % deviation from the set main compression force -mean value). This is also a cumulative parameter and gives information about all rejected tablets at that particular time. stiffness N Bottom punch stiffness in Newton: when the limit is reached, the press is stopped with suitable diagnosis. An equipment parameter. ejection N Maximum tablet ejection force: if this parameter rises, the tablet ejection friction is higher, which could mean that some minor sticking of the tablet has occurred on the tablet tooling. www.nature.com/scientificdata www.nature.com/scientificdata/ • Average filling device speed: the filling device speed average excluding 0 values. • Filling device speed changes: it is very uncommon to change the filling device speed during the compression process itself, so it makes more sense to consider only those changes during tablet press set-up, when the "production" parameter is 0, meaning that no good tablets have yet been made and we are therefore in the start-up phase of the process. • Average SREL start-up: the average standard relative deviation of the main compression force (SREL) during compression start-up, excluding values where the tablet press is not running (i.e., the tablet press speed or the filling device speed is 0). • Average SREL: the average SREL during a compression run, excluding values where the tablet press is not running (i.e., the tablet press speed or the filling device speed is 0). • SREL max: the maximum value of SREL during a compression run. It is essential to eliminate any excessive values, i.e., anything above 15, because such values are unrealistic for the product based on expert product knowledge and only appear in the beginning of the tablet press run due to the main compression force transitioning from 0 to the target value. The difference is naturally large, which shows in the SREL parameter. • Simple statistical methods were applied to other process time series parameters for the compression run as well as the start-up phase of the compression process: min, max, standard deviation, mean, median.
Based on batch sizes, normalization factors were calculated, which are included in a separate file within the enclosed dataset.

Data records
The data records are available at https://doi.org/10.6084/m9.figshare.c.5645578.v1 14 . The data is divided into the laboratory data ("Laboratory.csv"), the time series data ("Process time series"), the extracted features of time series data ("Process.csv"), and the normalization factors ("Normalization.csv"). Each of these is presented below.
Laboratory data. This file combines genealogy information, the incoming raw materials analysis, the intermediate product analysis and the final product analysis. There are altogether 1,005 rows of data, each row presenting one final product batch ( Table 3). The rows are identified by final product batch numbers; the batches can   www.nature.com/scientificdata www.nature.com/scientificdata/ be grouped into the so-called product sub-families by product codes. Laboratory data parameters are presented and explained in Table 4.
process time series data. The time series data are arranged in one file per product code (i.e., product sub-family). Each product code combines all final product batches manufactured in the selected period. The process time series data includes the most relevant tablet compression process parameters based on product history and expert knowledge collected for every 10 seconds of the process.
The following process parameters were considered: tablet press speed, filling device speed, main and precompression force mean value, tablet fill depth, standard relative deviation of main compression force (SREL), good and bad production, cylindrical height for main and pre-compression, bottom punch stiffness and ejection force. The effect of these parameters on tablet characteristics is summarized in Table 5, along with the shortened naming convention used in the uploaded data files as well as units for each parameter. Each time series file grouped by product code contains the same process parameters with the same naming convention.
It should be noted that all potential variations of the process parameters described in the table below led to a process within specification limits for all batches included in this research dataset. The variation of process parameters and product characteristics within the specification limits is normal and expected. extracted features of time series data. The time series data needed reduction and creation of new attributes before they could be readily used for the prediction analysis selected for the product. An example of attribute preparation is detailed in the Methods section, Preprocessing of time series data. A list of new attributes replacing the whole time series per each batch is detailed in Table 6.   Table 7. Statistical analysis and performance qualification.

List of parameters in laboratory data file (units of measure) Average
www.nature.com/scientificdata www.nature.com/scientificdata/ Based on batch sizes, normalization factors were calculated, which are included in a separate file within the enclosed dataset.

technical Validation
The laboratory data included in the current study are generated by trained laboratory technicians. Each analysis is performed following approved operating procedures for a particular test. The equipment used for the analysis is qualified by the supplier and the engineering team before release for use in the company. The maintenance of these analytical instruments demands periodic services by an external certified company and regular calibration before use for analysis. Analytical instruments need to comply with strict international and internal industry standards and are subject to regular audits, which confirm the robustness and reliability of these devices. Every analytical result generated as described above is then transcribed into a dedicated database by a laboratory technician that performed the analysis. This entry needs to be verified and signed off by a second person in order for it to be uploaded into the database. The entire process follows good laboratory, manufacturing and documentation practice as defined by pharmaceutical industry standards (international and internal), resulting in data we can trust.
The laboratory data for 1,005 batches have been collected from databases and verified further to check for potential outliers and observe whether the scatter of the data is expected for the selected product. We applied statistical analysis to all collected laboratory parameters. Furthermore, the process capability index was calculated for measurements where laboratory data follows normal distribution and the parameter includes both the upper and the lower specification limit. (Note: quality parameters often only have either the upper or the lower specification limit, not both.) Overall process capability (Ppk) tells us whether the process is under control and within the limits (i.e., upper and lower specification limits). Ppk above 1 indicates a well-controlled process and is expected for the selected product (e.g., Figs. 1, 2). The laboratory parameters that are applicable for Ppk calculation are dissolution average (Fig. 1) and dissolution min (Fig. 2), which both demonstrate normal distribution. This is confirmed by a comparison of the actual process data (blue histogram bars) with a normal distribution curve (solid red curve on the graph). Ppk calculation applied to these data is as follows:   where USL stands for the upper specification limit, LSL the lower specification limit, X for the average value and σ for standard deviation. Statistical evaluation results (average, standard deviation, minimum and maximum values), which are gathered in Table 7 and indicate the data, follow the expected spread without extreme outliers from the average. All measurements are within the specification limits for each parameter. The greatest variation occurs for API total impurities, residual solvent and total product impurities (RSD is 50.1%, 91.1% and 71.3% respectively). This variation is acceptable due to the small absolute spread of the data. Furthermore, the upper specification limit for these three parameters is double or higher the maximum value, which further confirms the acceptability of results given.
The time series data included in the current study are generated by the tablet press machine. Tablet presses are installed and qualified by the supplier. Before they are released for use in production, an internal engineering team performs extensive qualifications and verifies all operational functionalities that need to comply with the predefined user requirements. Tablet presses are then approved for use in production and are regularly serviced, re-qualified and calibrated with the frequency defined by international pharmaceutical standards for production equipment. Software and server functionalities are separately verified on a regular basis by dedicated IT teams. Tablet presses are also subject to regular audits, which demand the highest industry standards for every production equipment. The data generated by the tablet press are thus a reliable source of information about process dynamics.
The time series were extracted from a server linked with the tablet press database and visualized. In total, 1,005 visual inspections were carried out as part of initial data validation. The batches where process profiles did not follow the expected trend were inspected in detail and preprocessed if needed. An example of a preprocessing requirement due to unstructured time may be seen in Figs. 3 and 4.
Besides unstructured time for some batches (e.g., Fig. 3), there was a so-called "data gap" observed. These are short intervals of several minutes, where, for some batches, no data is logged. The reason for this is loss of tablet press power (i.e., shut down). These are very rare events and were removed because they do not bring any added value to the process understanding, but could potentially impact further data analysis.
Time format correction led to time series profile, which conforms with the nature of the manufacture process in question (Fig. 4).  www.nature.com/scientificdata www.nature.com/scientificdata/ Missing data. The laboratory data that are missing for some batches are active pharmaceutical ingredient (API) analysis results: total impurities, impurity L, water content and particle size. These are missing for a particular source of API material, i.e., all missing data belong to the same API code. It is recommended to either attempt further analysis with parameters that have missing data or, in case they are not relevant for the type of analysis, the parameter may be dropped altogether.

Usage Notes
Pharmaceutical product quality analysis is mandatory before any batch of medicine is released to the market 11,15 . This process is currently predominantly laboratory based and thus very time consuming for most products across the industry. Each product batch has a wealth of data from incoming raw materials to intermediate product analysis and detailed process time series for every equipment sensor 1 . These data have been described and provided for in the present data descriptor publication for the selected pharmaceutical product. We recommend to use the data further for the prediction of final product quality (for parameters in Table 4, last 6 rows).
As described in the example analysis dataset, we have used time series datasets to extract parameters based on expert knowledge and the impact the tablet compression process has on the product in question (Table 6). If this approach does not lead to reliable enough prediction models, it would be better to use whole time series datasets and use a deep learning methodology in order to find attributes that describe the selected product quality better.

code availability
Python code is available at figshare within the same collection as the dataset: 14 https://doi.org/10.6084/ m9.figshare.c.5645578.v1. The code is available for preprocessing, visualization and attribute extraction.